#Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Nov 21 20:35:38 2022"

The text continues here.

Assignment 1: Tasks and Instructions

1. Getting started (3p)

  1. Check that you have everything installed and created according to the instructions. You should have

2. Chapter1.Rmd (5p)

Open the file chapter1.Rmd located in your IODS-project folder with RStudio. Just write some of your thoughts about this course freely in the file, e.g.,

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Nov 21 20:35:38 2022"

1. How are you feeling right now?

I am excited about the course. It is a bit intimidating, since it seems that there is quite a lot of work. But, I cant wait to be able to apply what I will learn here to my own research.

2. What do you expect to learn?

I am a PhD student and I found the course content very fitting for my needs. I am the most excited to learn more about GitHub and R Markdown. Also, the classes regarding model validation (2), clustering/classification (4), and dimensionality reduction techniques (5) are very interesting topics for me, since I am doing my PhD on psychometric validation of (Short) Warwick-Edinburgh Mental Well-being Scale, (S)WEMWBS, among Finnish population.

3. Where did you hear about the course?

I remember taking one of the Kimmo’s course almost 10 years ago, when I was an undergraduate. Even though the course at the time was held in very early morning, I really enjoyed his class. I live in Austarlia, and When I noticed (in Sisu) that he is holding an online course again, I signed up immediately.


Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1:

4. How did it work as a “crash course” on modern R tools and using RStudio?

I have used RStudio before, so I am familiar with the program and I had everything already installed. However, this is the first time when I will be using R Markdown and GitHub.

I also have another statistics course at the moment where will be using R Markdown, so I am excited to learn the syntax and get familiar with the program, along with GitHub, to see how I can use it in my own research. GitHub for example, could work really well, when I have multiple different scripts that I will be testing/editing. I also really like the layout of R Markdown, it is so much easier to follow when you knit it, than normal R script. The interactive features are amazing and can be really cool to add as supplementary material into your manuscript, so people could view different scenarios and examine the topic a bit more deeper.

Also, the R Markdown Tutorial was very helpful. R Markdown Tutorial

5. Which were your favorite topics?

I really like the layout of R Markdown and the Cheat Sheet Cheat Sheet

Also, I found that the R for Health Data Science book very helpful and I know that I will be using it a lot in the future. It seems to have very illustrative examples and code that I can adapt to my own research. I think it will be much more useful when later on we have actual exercises when we need to write our own code. I tend to use Stack Overflow, general Googling, and other peoples code as dictionary or grammar book, when I need to solve some issues with my code. In my opinion you learn the best when you are simultaneously trying to apply the piece of code to solve a problem. Just reading/viewing it is also helpful, but it is hard to grasp all the information at once without a specific task you try to solve.

6. Which topics were most difficult?

I think I have okay understanding of R and RStudio. I know how to “read” and “edit” most of the code, intall and use new packages, etc. The difficult part is when you have an idea what you want to do, and you try to find the best way to edit the code (for example, getting certain colours, divide data based on stratas etc.). Sometimes the packages have different syntax than the “normal” R code, even the syntax in R Markdown is different (e.g., how to mark comments).

However, I found the example code in Exercise1.Rmd very helpful to get started. I would prefer if R for Health Data Science would also have a PDF version, since I prefer to have a copy saved on my personal laptop, so I could highlight and add comments to the text. Also, if I understood it correctly, the book is based on around using the tidyverse-package, since pipe %>% is a part of this package, and would not work if you don’t have tidyverse() installed. There are many ways to write the R code by using different packages and some are using the basic R code and some their own, and sometimes they are mixed. Having a tutorial that would help to understand which syntax you need/can use would be very beneficial.

However, I have not used the GitHub before, so I found it quite difficult to get it started and understand the layout and what things are saved to my personal computer/files and which are online. “Committing” and “Pushing” things to GitHub seemed also quite hard at the start.

I also find it challenging to learn/understand the YAML at the start of R Markdown script, and how to edit them

  • title: “Introduction to Open Data Science, Exercise set 1
  • subtitle: “R basics using the RHDS book
  • output:
  • html_document:
  • theme: flatly
  • highlight: haddock
  • toc: true
  • toc_depth: 2
  • number_section: false

For example, my index.Rmd code did not run at the start and trying to find the ways to fix it was difficult. In the end it just worked even though I did not change anything - I think it was trying to knit the script into something else than html.

Also add in the file a link to your GitHub repository (that you created earlier): https://github.com/your_github_username/IODS-project

Remember to save your chapter1.Rmd file.

3. index.Rmd (2p)

Open the index.Rmd file with RStudio.

At the beginning of the file, in the YAML options below the ‘title’ option, add the following option: author: “Your Name”. Save the file and “knit” the document (there’s a button for that) as an HTML page. This will also update the index.html file.

index.Rmd error code

Error in yaml::yaml.load(…, eval.expr = TRUE) : Parser error: while parsing a block mapping at line 1, column 1 did not find expected key at line 3, column 3 Calls: … parse_yaml_front_matter -> yaml_load -> Execution halted

4.(This point added on Mon 7 Nov 2022) Create a Personal Access Token (PAT) (5 p)

To make the connection between RStudio and GitHub as smooth as possible, you should create a Personal Access Token (PAT).

The shortest way to proceed is to follow the steps below. (Source: https://happygitwithr.com/https-pat.html)

Execute the R commands (preceded by ‘>’) in the RStudio Console (below the Editor):

> install.packages(“usethis”) > usethis::create_github_token()

GitHub website will open in your browser. Log in with your GitHub credentials.

  • Write a Note in the box, for example “IODS Project”.
  • Select an Expiration time for your PAT, e.g., 50 days.
  • The pre-selected scopes “repo”, “workflow”, “gist”, and “user” are OK.
  • Press “Generate token” and
  • copy the generated PAT to your clipboard.

Return to RStudio and continue in the Console:

> gitcreds::gitcreds_set()

  • WAIT until a prompt “Enter password or token:” appears.
  • Paste your PAT to the prompt and press Enter.

Apparently, I already had PAT, but I decided to update it, so I could finish this assignment. Now you should be able to work with GitHub, i.e., push and pull from RStudio.

5. Upload changes to GitHub (5p)

Upload the changes to GitHub (the version control platform) from RStudio. There are a few phases (don’t worry: all this will become an easy routine for you very soon!):

  • First, select the “Git” tab in the upper right corner of RStudio. You will see a list of modified files.
  • Select “Commit”.
  • It will open a new “Review Changes” window showing more detailed information of the changes you have made in each file since the previous version.
  • Tick the box in the front of each file (be patient, it takes some time for the check to appear).
  • Write a small commit message (there’s a box for that) that describes your changes briefly.
  • After this task is completed (not yet), both the changes and the message will be seen on GitHub.

Note: It is useful to make commits often and even on small changes.
Commits are at the heart of the version control system, as a single commit represents a single version of the file.)

  • Press “Commit”.(RStudio uses Git to implement the changes included in the commit.)
  • Press “Push”. (RStudio uses Git to upload the changes to your GitHub repository.)
  • Now you can close the “Review Changes” window of RStudio. Good job!!

5. Check GitHub and Submit Your Assignment

After a few moments, go to your GitHub repository at https://github.com/your_github_username/IODS-project to see what has changed (please be patient and refresh the page).

Also visit your course diary that has been automatically been updated at https://your_github_username.github.io/IODS-project and make sure you see the changes there as well.

After completing the tasks above you are ready to submit your Assignment for the review (using the Moodle Workshop below). Have the two links (your GitHub repository and your course diary) ready!
Remember to get back there when the Review phase begins (see course schedule).

End of Assignment 1: Tasks and Instructions
***

Assignment 2: Tasks and Instructions

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Nov 21 20:35:38 2022"

Data wrangling (max 5 point)

TASK INSTRUCTIONS: Create a folder named ‘data’ in your IODS-project folder. Then create a new R script with RStudio. Write your name, date and a one sentence file description as a comment on the top of the script file. Save the script for example as create_learning2014.R in the data folder. Complete the rest of the steps in that script.

Figure demonstrates how to create a new folder.

Please see create_learning2014.R and lrn14_KS.csv to evaluate the Data wrangling from my a GitHub repository: https://github.com/kiirasar/IODS-project you can find the files in data folder.

Analysis (max 15 points)

First we install/use R packages we need to complete the assignment.

# Select (with mouse or arrow keys) the install.packages("...") and
# run it (by Ctrl+Enter / Cmd+Enter):

# install.packages("GGally")
#install.packages("GGally")
#install.packages("tidyverse")
#install.packages('readr')
#install.packages('ggplot2')
#install.packages("psych")
#install.packages("vtable")
library(vtable)
## Warning: package 'vtable' was built under R version 4.2.2
## Loading required package: kableExtra
## Warning: package 'kableExtra' was built under R version 4.2.2
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ✔ purrr   0.3.4
## Warning: package 'readr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()      masks psych::%+%()
## ✖ ggplot2::alpha()    masks psych::alpha()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()
library(readr)
library(ggplot2)

1. Read and describe the data (0-2 points)

TASK INSTRUCTIONS: Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url.

Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Information related to data can be found
here

# Read the data from your local drive using setwd()-command

# setwd('C:\\Users\\Kiira\\Documents\\PhD_SWEMWBS\\PhD Courses\\Courses in 2022\\PHD-302 Open Data Science\\IODS-project')
# lrn14 <- read_csv("data/lrn14_KS.csv")
# head(lrn14) #gender, age, A_att, A_deep, A_stra, A_surf, points
# View(lrn14)

# or from url
std14 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=T) # sep=separator is a comma, header=T

head(std14) #gender, age, attitude, deep, stra, surf, points
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
View(std14)

head() command is used to show the first 6 lines of the datase, whereas View() opens the whole dataset into a new tab.

NOTE. data from local drive is named as lrn14 and data from url as std14.
Read_csv-command worked on R before, but I could not knit the document for some reason. This is why its only there as comments and I decide use the url (std14) dataset. The data is exact same, only the variable names are different. I will use the url data to complete the assignment.

dim(std14)
## [1] 166   7

dim() is R function to explore the dimension of the dataset. The dataset has 166 rows (observations) and 7 columns (variables).You can read the name of the variables or have better look at the data by using head(std14) and View(std14)

str(std14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

str() is R function to explore the structure of the dataset. The dataframe has 166 observations and 7 variables, like in dim().

  • gender is the only character, chr, (categorical) variable. It has 2 different values: F=female and M=male
  • age and points are integral, int, variables, meaning those don’t have any decimals.
  • attitude, deep, stra, and surf are numerical, num, variables, with decimals. These variables are also mean-variables based on specific questions/items that were measuring each learning type.

2. Graphical overview of the data (0-3 points)

TASK INSTRUCTIONS: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

SUMMARY STATISTICS:
To explore the summaries of each variable I used vtable-package and st()-command, also know as sumtable().
Here is link where you can find more information regarding vtable and st()-command

st(std14) # the command prints a summary statistics table to Viewer-window
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
gender 166
… F 110 66.3%
… M 56 33.7%
age 166 25.512 7.766 17 21 27 55
attitude 166 3.143 0.73 1.4 2.6 3.7 5
deep 166 3.68 0.554 1.583 3.333 4.083 4.917
stra 166 3.121 0.772 1.25 2.625 3.625 5
surf 166 2.787 0.529 1.583 2.417 3.167 4.333
points 166 22.717 5.895 7 19 27.75 33

Dataset std14 has a total of 166 observations (participants) and 7 variables (gender, age, attitude, deep, stra, surf and points). In the dataset:

  • Gender: 66.6% are female (F) and the rest 33.7% male (M)
  • Age: Mean age is 22.5 years-old, with std 7.8. Age range is between 17 and 55.

NOTE. The different learning methods (deep, stra, surf) are average based on several items for each learning method. The summary display the basic descriptive statistics: mean, standard deviation, minim, lower and upper quartiles (25% and 75%) and maximum. The scale among learning techniques are 1-5.

  • Attitude: mean attitude is 3.1 (0.7), with range between 1.4 and 4.9. lower quartile (25%) is 2.6 and higher quartile (75%) 3.7 points.
  • Deep: mean deep learning is 3.7 (0.6), with higher and lower quartiles 3.3 and 4.1, respectively. Range is between 1.6 and 4.9
  • Stra: mean strategic learnign (stra) is 3.1 (0.8), ranging between 1.3-5.0 25% and 75% 2.6 and 3.6, respectively.
  • Surf: mean surface learning (surf) is 2.8 (0.5), range 1.6-4.3. lower quartile (2.4) and higher quartile (3.2)

Points denotes the students exam points in a statistics course exam.

  • Points: mean points were 22.7 (5.9), ranging between 7 and 33 points with lower quartile of 19 and upper 27.75 points.

BARPLOTS - Nominal variables:

I used ggplot-package and barplot to explore the distributions and counts based of gender (nominal)

# ggplot()=command, std14=dataframe, eas(x=variable) + type of plot
gg_gender <- ggplot(std14, aes(x=gender)) + geom_bar() #barplot for nominal variables.
gg_gender

# you can make the plots looking prettier by adding extra code:
ggplot(std14, aes(x=as.factor(gender), fill=as.factor(gender) )) +
  geom_bar(aes(fill=gender)) + 
  geom_text(stat='count',aes(label=..count..),vjust=-0.3) + #Adding counts on top of the bars
  labs(x = "", fill = "gender") + #filling bars based on gender
  ggtitle("Barplot based on gender Learning 2014 dataset") + #adding title
  ylab("count")+ xlab("gender") + #adding x and y labels
   scale_x_discrete(labels=c("F" = "Female", "M" = "Male")) #changing F into female and M into male

According to the previous summary table and barplot dataset std14 has 110 female and 56 male participants.

HISTOGRAMS - Continuous variables:

I made histograms for every continuous variable: age, attitude, deep, stra, surf, and points, in order to check if these are normally distributed - meaning that the distribution follows the bell curve. If variables are not normally distributed, we can’t use parametric statistical approaches e.g., general regression models, but rather non-parametric statistical methods.

NOTE. When making plots, it is important to include everyone. Some people might have difficulties see all the colours e.g., colour blind, so it is imporant to use right colours. On this website you can find inclusive colour pallets.

  • Palette with grey: cbPalette <- c(“#999999”, “#E69F00”, “#56B4E9”, “#009E73”, “#F0E442”, “#0072B2”, “#D55E00”, “#CC79A7”)
  • Palette with black: cbbPalette <- c(“#000000”, “#E69F00”, “#56B4E9”, “#009E73”, “#F0E442”, “#0072B2”, “#D55E00”, “#CC79A7”)

The #CODE are referring to certain colours.

Also, I wanted to print all the plot in one page by using multiplot()-command which is part of ggplot-package. Before using the command I needed to run a code that can be found here

multiplot <- function(..., plotlist = NULL, file, cols = 1, layout = NULL) {
  require(grid)

  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  if (is.null(layout)) {
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                 ncol = cols, nrow = ceiling(numPlots/cols))
}

if (numPlots == 1) {
print(plots[[1]])

} else {
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

for (i in 1:numPlots) {
  matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

  print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                  layout.pos.col = matchidx$col))
 }
}
 }

Then I created a histogram for each continuous variable and used the colours from

  • cbPalette <- c(“#999999”, “#E69F00”, “#56B4E9”, “#009E73”, “#F0E442”, “#0072B2”, “#D55E00”, “#CC79A7”)
# NOTE. you can also use "=" instead of "<-" to create objects. However, ofter "<-" is better, since some packages might use "=" for something else.

p1=ggplot(std14) + 
  geom_histogram(aes(x = age), fill = "#E69F00") +
labs(title="Age")

p2=ggplot(std14) + 
    geom_histogram(aes(x = attitude), fill = "#56B4E9") +
  labs(title="Attitude")

p3=ggplot(std14) + 
    geom_histogram(aes(x = deep), fill = "#009E73")+
  labs(title="Deep learning")

p4=ggplot(std14) + 
    geom_histogram(aes(x = stra), fill = "#F0E442")+
  labs(title="Strategic learning")

p5=ggplot(std14) + 
    geom_histogram(aes(x = surf), fill = "#0072B2")+
  labs(title="Surface learning")

p6=ggplot(std14) + 
  geom_histogram(aes(x = points), fill = "#D55E00")+
  labs(title="Points")

Last, I ran the multiplot()-command.

multiplot(p1, p2, p3, p4, p5, p6, cols=3) #prints 3 columns
## Loading required package: grid
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Summary regarding the results:

  • Variables age and points are using different scale than the other ones (other variables: 1-5)
  • Age: is a bit skewed on left, meaning there are more “younger” than older. If age would be normally distributed, the “peak” should be more in the middle (approx. 30-40 years). Otherwise, all the other variables seem to be normally distributed.

Next, I wanted to create a histogram where I added all the learning strategies on top of each other.

# NOTE. alpha=.5, makes the colours trasparent 50%.

ggplot(std14) + 
  geom_histogram(aes(x = deep), fill = "#009E73", alpha=.5) + # green
  geom_histogram(aes(x = stra), fill = "#F0E442", alpha=.5) + # yellow
  geom_histogram(aes(x = surf), fill = "#0072B2", alpha=.5) + # blue
  labs(title="Learnign strategies", x="Learning strategies (Mean)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Summary regarding the results:

  • You can see that deep learning (green) is a bit more skewed right than the other scales. Similarly, surface learning (blue) is a bit skewed on left.

RELATIONSHIP between the variables:

ggpairs()-comman is part of ggplot and it creates more advanced plot matrix where you can explore the relationships between the variables.

ggpairs(std14, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

The command prints out:

Histograms (first 2 columns on left) based on gender (female, male)

  • more female than male participants
  • all other variables, expect age seem to be more or less normally distributed

Boxplots (first row) based on gender: female (top), male (bottom)

  • Think line illustrates mean, both ends of the boxes are 25% and 75% quartiles. The lines are illustrating confidential intervals and dots are outliers (single participants)
  • Based on the means of each boxplot: females are younger, have lower attitudes and bit lower deep learning scores, but higher strategic and surface learning points than males. There does not seem to be difference based on the means in overall points.
  • However, there does not seem to be statistical difference based on gender (all boxes are align with each other).

Normal distributions (diagonal) only for continuous variables

  • Age seem to be most skewed on left (much younger paricipants than old ones)
  • All the other continuous variables seem to be more or less normally distributed bell-curve)

Correlations (up diagonal) only for continuous variables

  • Only few relationship are statistical significant (p<.001, p<.01, p<.05 three, two and one*, respectively)
    • strategic learning and surface learning (negative correlation; r= -0.161) meaning higher stra is associated with lower surf, and vice versa
    • attitude and surface learning (negative correlation; r= -0.176) meaning higher attitude is associated with lower surf, and vice versa.
    • deep learning and surface learning (negative correlation; r= -0.324) meaning higher deep is associated with lower surf, and vice versa.
    • points and attitude (positive correlation; r= 0.437) meaning higher points is associated with higher attitude, and vice versa.

Scatterplots - Relatinships between continuous variables

  • On the figure above the scatterplots are below the diagonal (normal distribution graphs). Scatterplots illustrate the correlation results and therefore
    • negative correlation = line goes down
    • positive correlation = line goes up
    • no correlation = straight line

However, the figure is quite small, so it is easier to explore the scatterplots by using pairs() command.

# this piece of code excludes the gender (nominal variable)
pairs(std14[-1]) 

But even that is quite ugly. Also, the plots below and above the diagonial line are identical (just opposite scaling). To make the scatterplots nicer, we can create nicer scatterplots with ggplot.

Age scatterplots

# Age
sp1 <- ggplot(std14, aes(x = age, y = attitude)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: age and attitude")


sp2 <- ggplot(std14, aes(x = age, y = deep)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: age and deep learning")

sp3 <- ggplot(std14, aes(x = age, y = stra)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: age and strategic learning")

sp4 <- ggplot(std14, aes(x = age, y = surf)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: age and surface learning")

sp5 <- ggplot(std14, aes(x = age, y = points)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: age and points")


multiplot(sp1, sp2, sp3, sp4, sp5, cols=3)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

# age and attitude, r=0.022
# age and deep learning, r=0.025
# age and strategic learning, r=0.102
# age and surface learning, r=-0.141
# age and points, r=-0.093

Since age is very skewed the correlation between other variables are very low. Meaning the scatterplots and regression line is very flat, indicating low or non-correlation. Age does not seem to be related to different learning techniques, attitudes or overall exam points. However, age was also very skewed meaning a lot of people were same age, that may affect the results.

Attitude scatterplots

## Attitude 
sp6 <- ggplot(std14, aes(x = attitude, y = deep)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: attitude and deep learning")

sp7 <- ggplot(std14, aes(x = attitude, y = stra)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: attitude and strategic learning")

sp8 <- ggplot(std14, aes(x = attitude, y = surf)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: attitude and surface learning")

sp9 <- ggplot(std14, aes(x = attitude, y = points)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: attitude and points")

multiplot(sp6, sp7, sp8, sp9, cols=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

#attitude and deep learning, r=0.110
#attitude and strategic learning, r=0.062
#attitude and surface learning, r= -0.176*
#attitude and points, r=0.437***

The first two graphs (on left column) had very small correlation and the line was very flat.
The relationship between attitude and surface learning (top right) shows a small significant negative correlation (r= -0.176), indicating that the line goes down: people with higher surface learning points, would have higher chance to have lower attitude points as well.
This could mean that people who use surface learning techniques have worsen attitude towards learning in general.

Alternatively, the relationship between attitude and points (down right) show significant positive correlation (r=0.437); the line goes up. Indicating that individuals with high attitude points would often also have high overall points - and vice versa; individual with low attitude would also have low overall points.
One interpretation of these finding is that people who have good attitude towards learning will also success better in their exams.

Deep learning scatterplots

# deep

sp10 <- ggplot(std14, aes(x = deep, y = stra)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: deep and strategic learning")


sp11 <- ggplot(std14, aes(x = deep, y = surf)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: deep and surface learning")

sp12 <- ggplot(std14, aes(x = deep, y = points)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: deep learning and points")

multiplot(sp10, sp11, sp12, cols=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

#deep and strategic learning, r=0.097
#deep and surface learning, r= -0.324***
#deep learning and points, r= -0.010

Only, relationship between deep and surface learning (bottom left) had significant correlation (r= -0.324). The correlation was also negative, meaning that higher deep learning scores were associated with lower surface learning scores and vice versa.

This could mean that people who often use deep learning techniques do rarely use surface learning techiques and vice versa.

Other relationship showed barely any correlation and therefore the line was fairy flat.

Strategic learning

# stra

sp13 <- ggplot(std14, aes(x = stra, y = surf)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: strategic and surface learning")

sp14 <- ggplot(std14, aes(x = stra, y = points)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: strategic learning and points")


sp15 <- ggplot(std14, aes(x = surf, y = points)) +
  geom_point() + #scatterplot
  geom_smooth(method = "lm") + #regression line
  labs(title="Scatterplot: surface learning and points")

multiplot(sp13, sp14, sp15, cols=2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

#strategic and surface learning, r= -0.161*
#strategic learning and points, r=0.146
#surface learning and points, r= -0.144

Only the association between strategic and surface learning showed significant correlation, which was negative (r=-0.161), meaning that lower points were associated with higher surface learning points.
This could mean that people who only use surface learning techniques will struggle to grasp more deeper understanding of different concepts that could lead lower exam points.

Below there is a code and figure with all the scatterplots by using multiplot(), but yet again, the graphs are too small, so the interpretation of the findings is difficult.

multiplot(sp6, sp7, sp8, sp9, sp10, sp11, sp12, sp13, sp14, sp15, cols=4) #prints 5 columns
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'


3. Regression model - outcome variable (0-4 points) &

4. Interpret the model parameters and R-square (0-3 points)

TASK INSTRUCTIONS: Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)

Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)

  • dependent (y): points
  • indepedent (x): deep, strategic and surface learning

I choose these three independent variables, since they are different strategies how people learn.

my_model3 <- lm(points ~ deep + stra + surf, data = std14)
# my_model3 #call the linear model, intercept and slopes.
summary(my_model3) #summary of the model, including the single variable statistical significant summaries
## 
## Call:
## lm(formula = points ~ deep + stra + surf, data = std14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1235  -3.0737   0.5226   4.2799  10.3229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.9143     5.1169   5.260  4.5e-07 ***
## deep         -0.7443     0.8662  -0.859   0.3915    
## stra          0.9878     0.5962   1.657   0.0994 .  
## surf         -1.6296     0.9153  -1.780   0.0769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared:  0.04071,    Adjusted R-squared:  0.02295 
## F-statistic: 2.292 on 3 and 162 DF,  p-value: 0.08016

Based on the model only strategic and surface learning are statistical significant (. = 0.10), but deep learning is not.

  • Deep learning: Using deep learning techniques seem to have worsen impact on exam points (estimate= -0.7443), meaning the when deep learning increases 1 point, it will lower the exam points by -0.7443 points. This is surprising, since I would have assumed that deep learning methods would be associated with higher exam points. On the other hand, the correlation between deep learning and points were negative, yet not significant. Overall, deep learning is not statistically significant in this model (p-value is greater than 0.1), this learning technique does not seem to explain/predict exam points.
  • Stretegic learning: Using strategic learning techniques has a positive impact on exam points, meaning that when strategic learning increases 1 point, exam points increases by 0.9878 points. This is also statistically significant (p<.10). However, this is not very significant impact. This sort of makes sense, since using strategic learning techniques allows the individual be more adaptable and tehy might use more time to undertsand the overall content instead of going through everything very “deeply”.
  • Surface learning: Using surface learning techniques was negatively associated with exam points. This effect was also significant (p<.01). This means that when surface learning increases by 1 point, it decreases the exam points by -1.6296.

However, the model explain only 4-2.3% of the exam results (Multiple R-squared = 0.04071 and Adjusted R-square = 0.02295). Also the overall p-value of the whole model is relatively bad 0.08016.

Overall, it seems that different learning techniques will pay either none or only little role in explaining overall exam points.

Since, deep learning is not statistically significant, I will remove it from the model and fit the model again without it.

NOTE. Overall, p<.01 is not very good result, normally p<.05 is the level of statistical significant results at least in my research field (psychology).

my_model2 <- lm(points ~ stra + surf, data = std14) #exclude deep learning from the model
summary(my_model2) #summary of the model
## 
## Call:
## lm(formula = points ~ stra + surf, data = std14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4574  -3.2820   0.4296   4.0737   9.8147 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.5635     3.3104   7.118 3.31e-11 ***
## stra          0.9635     0.5950   1.619    0.107    
## surf         -1.3828     0.8684  -1.592    0.113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.822 on 163 degrees of freedom
## Multiple R-squared:  0.03634,    Adjusted R-squared:  0.02452 
## F-statistic: 3.074 on 2 and 163 DF,  p-value: 0.04895

When excluding the “deep learning” from the model, neither strategic or surface learning remain significant.

As additional task, I wanted to try a completely new model, where attitude, deep and surface learning could try to explain the exam points. I chose these, since they had the highest correlations. However, this might cause multi-collienarity that can impact on our results.

  • dependent (y): points
  • indepedent (x): attitude, deep, and surface learning
my_model32 <- lm(points ~ attitude + deep + surf, data = std14)
summary(my_model32) #summary of the model, including the single variable statistical significant summaries
## 
## Call:
## lm(formula = points ~ attitude + deep + surf, data = std14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9168  -3.1487   0.3667   3.8326  11.3519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.3551     4.7124   3.895 0.000143 ***
## attitude      3.4661     0.5766   6.011 1.18e-08 ***
## deep         -0.9485     0.7903  -1.200 0.231815    
## surf         -1.0911     0.8360  -1.305 0.193669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1876 
## F-statistic:  13.7 on 3 and 162 DF,  p-value: 5.217e-08

Interestingly, neither deep or surface learning were significant independent variables to predict/explain exam points. However, attitude was highly significant (p<.0001). The results might be caused by the high positive association/correlation between attitude and exam points.

  • Attitude: was statistically significant (p<.0001) and positively associated with points. When attitude increases 1 point, exam points increases by 3.4661 points.
  • Deep learning: even though deep learning was not significant it has a bit higher negative association to exam points in comparison to previous model3. One increase in deep learning worsens the exam points by -0.9485 points.
  • Surface learning: similar to deep learning, surface learning had higher negative, yet non-significant, impact in this model than the previous (model3). For each increase, the exam points lowered -1.0911 points.

Overall, it seem that attitude plays much more bigger role explaining exam results than learning techniques.

The new model is also much better than the previous (model3): R-square was 0.2024, meaning that this model explains 20% variation in exam points. The models p-value was also much better than before: p<.0001

Lastly, I excluded both learning techniques from the model to see if we could increase the model fit.

my_model1 <- lm(points ~ attitude, data = std14)
summary(my_model1) #summary of the model, including the single variable statistical significant summaries
## 
## Call:
## lm(formula = points ~ attitude, data = std14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

This had a little impact:

  • Estimate increased from 3.4661 to 3.5255
  • Multiple R-squared decreased from 0.2024 to 0.1906, this is due to number of indicators (more indicators, better fit)

5. Plots (0-3 points)

TASK INSTRUCTIONS: Produce the following diagnostic plots:

  • Residuals vs Fitted values
  • Normal QQ-plot
  • Residuals vs Leverage.

Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.

FURTHER INSTRUCTIONS based on Exerice 2: R makes it easy to graphically explore the validity of your model assumptions, by using plot()-command e.g., plot(my_model3).
In the plot() function argument which can help you to choose which plots you want. We will focus on plots 1, 2 and 5:

which graphic
1 Residuals vs Fitted values
2 Normal QQ-plot
3 Standardized residuals vs Fitted values
4 Cook’s distances
5 Residuals vs Leverage
6 Cook’s distance vs Leverage

Before the call to the plot() function, add the following: par(mfrow = c(2,2)). This will place the following 4 graphics to the same plot.

RESIDUALS
In general the following graphs are focusing on exploring the residuals of the model.

In statistical point of view, residual is the difference between predicted values of y (dependent variable) and observed values of y . So. Residual = actual y value − predicted y value, (ri=yi−^yi).

Another example explaining residuals is the distance from the linear line. If the observations is located above the linear line, residual is positive and if the observation is located below the line, it is negative.

If our model would explain 100% of the variation of dependent variable, residual would be 0, meaning that all the observations would be touching the linear line.

In a way, you could say that residual is a measure of how well a linear line fits an individual data points.

The picture above is a screenshot from khanacademy.org

R-square on the other hand is calculated a correlation coefficient squared. Or, as well as, the sum of squares of residuals (also called the residual sum of squares, SSres) divided by the total sum of squares (proportional to the variance of the data, SStot) minus 1.

One way to illustrare the SSres is to draw squares between linear line and single data points in a way that the square would touch the linear line. The sum of the squares are SSres.

The picture above is a screenshot from Wikipedia.

Lastly, if the model explains only 5% of the variance of chosen dependent variable (outcome, y), it means that the residuals, everything else except the chosen independent variables (x), are explaining the rest 95% of the variance. Meaning, that we were not able to successfully detect the whole phenomena.

In the final assignment, I will use the first model (model3) as example

  • dependent (y): points
  • indepedent (x): deep, strategic and surface learning

whereas, deep learning is not significant indicator, but strategic and surface learning variables are, but the R^2 is very low (2%)

par(mfrow = c(2,2))
plot(my_model3, which = c(1,2,5))

Residual vs Fitted
This graph illustrates the residuals non-linear patterns. If the red line is roughly horizontal then we can assume that the residuals follow a linear pattern. If the residuals are not linear, we would need to consider non-parametric statistical approaches to investigate the relationship between the variables. Sometimes, there might be relationship between the variables, even though it would not be linear. This plot helps us to detect any other possible non-linear relationship (parabolic/quadratic/polynominal, expotential, steps, etc.)

Based on the graph, in model 3 the line seem to be fairly horizontal, so we can claim that the residuals are following linear patter along with the indicators (dependent, x-variables).
NOTE. This are just “raw” residuals, not standardized

Normal Q-Q
This graph is illustrating if the residuals of the regression model are the normal distributed. In a perfect world, the dots would align with the linear line, indicating that the residuals are in deed normally distributed. Each data point is presented in this picture (dot).

In model 3 it looks like that some observations from the begin and end of the data set are not in line with the linear model. However, overall it is roughly following the line, so we can confirm that the residuals are normally distributed.

Residual vs Leverage This graph is mainly used to spot influential data points, aka outliers, or single data points that could have an impact on our model. This graph can also be used to examine heteroskedasticity (different variance based on different independent variables) which can often be caused by an outlier. We can also investigate non-linearity with this graph.

  • On the y-axis are the standardized residuals. If the data point is located above 2 (or below -2) it can be classified as having a high standardized residual value.
  • The x-axis gives us the leverage value for each point, which you can calculate based on the number of observations (166) and coefficients in the model (3 explanatory variables + the intercept). (coefficient divided by observations) times 2
  • Plot also draws Cook’s Distance; any point that exists outside the 0.5 contour line has a high Cook’s distance.

See more Rummerfield & Berman, 2017, page 3

In our exmaple,

-we have some observations that are below -2 (y-axis) e.g., observations 145, 35, and 19 - The the average leverage is 4/166 ≈ 0.024 then any data point beyond 0.0482 or ≈ 0.05 (2 × 0.024) has a high leverage value. In our model 3 there a some obervations passed 0.5 (x-axis), which we could drop out. - However, our we cant even see the Cook’s distance contour line and these outliers are not too far away from the suggested cut-off lines.

Overall, we can conclude that our data does not have any (or only few) influential data points.


End of Assignment 2: Tasks and Instructions


(more chapters to be added similarly as we proceed with the course!)

date()
## [1] "Mon Nov 21 20:36:16 2022"

Assignment 3: Tasks and Instructions

1. Data wrangling (max 5 points)

Please see create_alc.R to evaluate the Data wrangling from my a GitHub repository: https://github.com/kiirasar/IODS-project you can find the files in data folder.


2. Analysis (max 15 points)

First we install/use R packages we need to complete the assignment.

# Select (with mouse or arrow keys) the install.packages("...") and
# run it (by Ctrl+Enter / Cmd+Enter):

# install.packages("GGally")
#install.packages("GGally")
#install.packages("tidyverse")
#install.packages('readr')
#install.packages('ggplot2')
#install.packages("psych")
#install.packages("vtable")
library(vtable)
library(psych)
library(GGally)
library(tidyverse)
library(readr)
library(ggplot2)
library(tidyr)
library(dplyr)

1. Create a new R Markdown file (chapter3.Rmd) and add this to your index.Rmd file


2. Read and describe the student alc data (0-1 point)

To read the dataset from either my local folder (read_csv()) or from url (reab.table()) use the commands in brackets.

alc_KS <- read_csv("data/create_alc_KS.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(alc_KS) 
## # A tibble: 6 × 35
##   school sex     age address famsize Pstatus  Medu  Fedu Mjob     Fjob    reason
##   <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>   <chr> 
## 1 GP     F        18 U       GT3     A           4     4 at_home  teacher course
## 2 GP     F        17 U       GT3     T           1     1 at_home  other   course
## 3 GP     F        15 U       LE3     T           1     1 at_home  other   other 
## 4 GP     F        15 U       GT3     T           4     2 health   servic… home  
## 5 GP     F        16 U       GT3     T           3     3 other    other   home  
## 6 GP     M        16 U       LE3     T           4     3 services other   reput…
## # … with 24 more variables: guardian <chr>, traveltime <dbl>, studytime <dbl>,
## #   schoolsup <chr>, famsup <chr>, activities <chr>, nursery <chr>,
## #   higher <chr>, internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>,
## #   goout <dbl>, Dalc <dbl>, Walc <dbl>, health <dbl>, failures <dbl>,
## #   paid <chr>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, alc_use <dbl>,
## #   high_use <lgl>
# or from url
alc_a3 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=T) # sep=separator is a comma, header=T

head(alc_a3) # Shows the first 5 rows of the dataset
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime schoolsup famsup activities nursery higher
## 1   mother          2         2       yes     no         no     yes    yes
## 2   father          1         2        no    yes         no      no    yes
## 3   mother          1         2       yes     no         no     yes    yes
## 4   mother          1         3        no    yes        yes     yes    yes
## 5   father          1         2        no    yes         no     yes    yes
## 6   mother          1         2        no    yes        yes     yes    yes
##   internet romantic famrel freetime goout Dalc Walc health failures paid
## 1       no       no      4        3     4    1    1      3        0   no
## 2      yes       no      5        3     3    1    1      3        0   no
## 3      yes       no      4        3     2    2    3      3        2  yes
## 4      yes      yes      3        2     2    1    1      5        0  yes
## 5       no       no      4        3     2    1    2      5        0  yes
## 6      yes       no      5        4     2    1    2      5        0  yes
##   absences G1 G2 G3 alc_use high_use
## 1        5  2  8  8     1.0    FALSE
## 2        3  7  8  8     1.0    FALSE
## 3        8 10 10 11     2.5     TRUE
## 4        1 14 14 14     1.0    FALSE
## 5        2  8 12 12     1.5    FALSE
## 6        8 14 14 14     1.5    FALSE
View(alc_a3) # Preview of the whole data set.

# In this assignment I will be using the dataset from url (alc_a3)
print(colnames(alc_a3)) # print the column names
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
glimpse(alc_a3) # have a bit better look at the data
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

Short description of the dataset alc_a3 by using glimpse()-command.

  • The data encompasses student achievement in secondary education of two Portuguese schools.
  • 370 rows (participant, observations, datapoints)
  • 35 columns (variables)
  • first variable is school, last one is high_use
  • 16 integral (int) variables (numeric): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, Dalc, Walc, health, failures, absences, G1, G2, G3
  • 17 charachter (chr) variables (nominal): school, sex, addres, famsize, Pstatus, Mjob, Fjob, reason, guardian, schoolsuo, famsup, activities, nursery, higher, internet, romantic, paid
  • 1 double variable (dbl): alc_use. All real numbers are stored in double precision format.
  • 1 logical (lgl) variable: high_use. Gets value “TRUE” is alc_use is higher than 2.0.

More detailed description of the dataset can be found here and Exercise3.Rmd as well as in Moodle (Assignment 3: Tasks and Instructions)


3. Hypothesis - relationship between high_use and 3 other variables - (0-1 point)

Variables:

  • high_use (alcohol consumption incl. both weekend/weekdays is higher than 2)
  • sex (binary: ‘F’ - female or ‘M’ - male): male has higher alcohol consumption than females. Risk factor
  • absences (number of school absences (numeric: from 0 to 93): positive association, higher absences level is associated with higher alcohol consumption. Risk factor
  • goout (going out with friends; numeric: from 1 - very low to 5 - very high): positive association, going out with friends (higher numbers) is associated with higher alcohol consumption. Risk factor
  • higher (wants to take higher education; binary: yes or no): wanting to take higher education (yes) is associated with lower alcohol consumption. Protective factor

4. Numerical and grapfical display (0-5 points)

High alcohol consumption (high_use: TRUE, FALSE) and sex (FEMALE, MALE): cross-tabulation and barplot

Creating across-tabulation

t0 <- xtabs(~high_use+sex, data=alc_a3)
ftable(t0) # print table.
##          sex   F   M
## high_use            
## FALSE        154 105
## TRUE          41  70
summary(t0) # chi-square test of indepedence. 
## Call: xtabs(formula = ~high_use + sex, data = alc_a3)
## Number of cases in table: 370 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 15.812, df = 1, p-value = 6.996e-05
# or

# http://rstudio-pubs-static.s3.amazonaws.com/6975_c4943349b6174f448104a5513fed59a9.html
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
t0_f = crosstab(alc_a3, row.vars = "high_use", col.vars = "sex", type = "f")
t0_f #frequency count
##          sex   F   M Sum
## high_use                
## FALSE        154 105 259
## TRUE          41  70 111
## Sum          195 175 370
t0_c = crosstab(alc_a3, row.vars = "high_use", col.vars = "sex", type = "c")
t0_c # %column
##          sex      F      M
## high_use                  
## FALSE         78.97  60.00
## TRUE          21.03  40.00
## Sum          100.00 100.00

Based on ftable(t0) and t0_f we can see that among females 154 drink below high use (FALSE) and 41 have high use (TRUE). Among males the frequensies are 105 and 74, respectively.

Based on the t0_c it is easier to establish the difference between sex. Male seem indeed use more alcohol (high use 40%) than females (high use 21%).

Based in the summary(0) you can also see that this difference is significant p<.001, x^2(1)=15.81

COLOUR PALETTES

Next we wil create a bar plot based on the results. To do so, we need to creat a new dataframe. Tips can be found here.

When making plots, it is important to include everyone. Some people might have difficulties see all the colours e.g., colour blind, so it is imporant to use right colours. On this website you can find inclusive colour pallets.

  • Palette with grey: cbPalette <- c(“#999999”, “#E69F00”, “#56B4E9”, “#009E73”, “#F0E442”, “#0072B2”, “#D55E00”, “#CC79A7”)
  • Palette with black: cbbPalette <- c(“#000000”, “#E69F00”, “#56B4E9”, “#009E73”, “#F0E442”, “#0072B2”, “#D55E00”, “#CC79A7”)
group_col=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2") #saving colour-blind "safe" colours

Creating a dataframe

# data frame based on frequencies
df = data.frame(group=c("Female Low", "Male Low", "Female High", "Male High"),
                value=c(154, 105, 41, 70))
df
##         group value
## 1  Female Low   154
## 2    Male Low   105
## 3 Female High    41
## 4   Male High    70
# data frame based on column %
df_p = data.frame(group=c("Female Low", "Male Low", "Female High", "Male High"),
                percentage=c(79, 60, 21, 40))
df_p
##         group percentage
## 1  Female Low         79
## 2    Male Low         60
## 3 Female High         21
## 4   Male High         40

Creating a plot - sex and alcohol consumption (high_use)

#based on frequency
ggplot(df, aes(x=group, y=value, fill=group)) + #basic plot
  geom_bar(stat="identity") + #define a plot and put groups are side-by-side
  geom_text(aes(label=value), vjust=-0.3, size=3.5) + #add frequencies
  scale_fill_manual(values=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2")) #add colours

#based on column%
ggplot(df_p, aes(x=group, y=percentage, fill=group)) + #basic plot
  geom_bar(stat="identity") + #define a plot and put groups are side-by-side
  geom_text(aes(label=percentage), vjust=-0.3, size=3.5) + #add frequencies
  scale_fill_manual(values=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2")) #add colours

Based on the plot you can also easily to see that based on the column% female participants consumed less alcohol (low=79%, high=21%) in comparison to male students (low=60%, high=40%). This result support our previous hypothesis: male has higher alcohol consumption than females. In other words, being a mae might be arisk factor when considering alcohol consumption among students.

High alcohol consumption (high_use) and absences: summary and boxplot

Use tapply()-command to search the basic summary of absences-variable.

tapply(alc_a3$absences, alc_a3$high_use, summary)
## $`FALSE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    3.00    3.71    5.00   45.00 
## 
## $`TRUE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   6.378   9.000  44.000

Absence seem to be higher among students with high acohol consumption habits: Mean=6.4, Mdn=4.0, similarly both the quartilies (1 and 3) and maximum were higher than for those who had low alcohol consumption habits. However, tapply() command does not test the statistical significancy, nor give SD values or confidential intervals (C.I.), so it is hard to make conclusion that this difference would be statistically significant.

Next, we initialize a boxplot of high_use and absences

g1 <- ggplot(alc_a3, aes(x = high_use, y = absences, fill=high_use)) + #alcohol consumption with absences (numeric)
  geom_boxplot() + ylab("Absences") + xlab("High alcohol use (more than 2)") +
  scale_fill_manual(values=c("#D55E00", "#0072B2")) #add colours
g1

Again, you see that the mean and both quarterlies are higher for TRUE values than fro FALSE values. However, the C.I.s are align with each other (lines) indicating that even though there is a difference in absences, it would not be statistically significant. However, excluding some of the outliers, might change the results. It seems like that there is one participant (datapoint) whose drinking habits are low (FALSE), but they have over 40 absences. Similar can be found in TRUE-values (high drinking) as well. In other words, it is unlikely that abseces from school is a risk factor when considering alcohol consumption among students.

High alcohol consumption (high_use) and going out with friends (goout: numeric 1=very low to 5=very high): cross-tabulation and barplot

Creating a cross-tabulation

# Cross-tabulation
t1 <- xtabs(~high_use+goout, data=alc_a3)
ftable(t1) # print table.
##          goout  1  2  3  4  5
## high_use                     
## FALSE          19 82 97 40 21
## TRUE            3 15 23 38 32
summary(t1) # chi-square test of indepedence. 
## Call: xtabs(formula = ~high_use + goout, data = alc_a3)
## Number of cases in table: 370 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 55.57, df = 4, p-value = 2.463e-11
# or http://rstudio-pubs-static.s3.amazonaws.com/6975_c4943349b6174f448104a5513fed59a9.html
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
t1_f = crosstab(alc_a3, row.vars = "high_use", col.vars = "goout", type = "f")
t1_f #frequency count
##          goout   1   2   3   4   5 Sum
## high_use                              
## FALSE           19  82  97  40  21 259
## TRUE             3  15  23  38  32 111
## Sum             22  97 120  78  53 370
t1_c = crosstab(alc_a3, row.vars = "high_use", col.vars = "goout", type = "c")
t1_c # %column
##          goout      1      2      3      4      5
## high_use                                         
## FALSE           86.36  84.54  80.83  51.28  39.62
## TRUE            13.64  15.46  19.17  48.72  60.38
## Sum            100.00 100.00 100.00 100.00 100.00

Based in the summary(0) you can also see that there is a significant difference p<.001, x^2(4)=55.57 between going out with friends and alcohol consumption. However, the chi-square test does not tell where this significant difference is located. However, based on the col% we can guess that the difference in alcohol consumption is higher among students who see their friends more often (scoring 4 or 5: F = 51% and 40%; T = 49% and 60%) in comparison students who dont see their friends that often (scoring 1-3: F ~ 80%; T ~ 20%).

Creating a dataframe

# data frame based on frequencies
df_p2 = data.frame(group=c("1 False", "2 False", "3 False", "4 False", "5 False",
                          "1 True", "2 True", "3 True", "4 True", "5 True"),
                percentage=c(86, 85, 81, 51, 40,
                             14, 15, 19, 49, 60))
df_p2
##      group percentage
## 1  1 False         86
## 2  2 False         85
## 3  3 False         81
## 4  4 False         51
## 5  5 False         40
## 6   1 True         14
## 7   2 True         15
## 8   3 True         19
## 9   4 True         49
## 10  5 True         60

Creating a plot - going out with friens (goout) and alcohol consumption (high_use)

#based on column%
ggplot(df_p2, aes(x=group, y=percentage, fill=group)) + #basic plot
  geom_bar(stat="identity") + #define a plot and put groups are side-by-side
  geom_text(aes(label=percentage), vjust=-0.3, size=3.5) + #add frequencies
  scale_fill_manual(values=c("#E69F00", "#D55E00", "#E69F00", "#D55E00", "#E69F00", "#D55E00","#56B4E9", "#0072B2", "#56B4E9", "#0072B2")) #add colours

Based on the plot you can also easily to see that based on the column% studeny who don’t see their friends that often will often have less prevelance of high alcohol consumption, whereas the prevelance for high alcohol consumption seem to be ~50%-50%, or even higher 60%-40% among students who see their friends often. These result support our previous hypothesis: going out with friends (higher incidence) is associated with higher risk of high alcohol consumption.

High alcohol consumption (high_use) and wanting to go take higher education after graduation (higher: YES, NO): cross-tabulation and barplot

Creating a cross-tabulation

# Cross-tabulation
t2 <- xtabs(~high_use+higher, data=alc_a3)
ftable(t2) # print table.
##          higher  no yes
## high_use               
## FALSE             7 252
## TRUE              9 102
summary(t2) # chi-square test of indepedence. 
## Call: xtabs(formula = ~high_use + higher, data = alc_a3)
## Number of cases in table: 370 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 5.487, df = 1, p-value = 0.01916
##  Chi-squared approximation may be incorrect
# or http://rstudio-pubs-static.s3.amazonaws.com/6975_c4943349b6174f448104a5513fed59a9.html
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
t2_f = crosstab(alc_a3, row.vars = "high_use", col.vars = "higher", type = "f")
t2_f #frequency count
##          higher  no yes Sum
## high_use                   
## FALSE             7 252 259
## TRUE              9 102 111
## Sum              16 354 370
t2_c = crosstab(alc_a3, row.vars = "high_use", col.vars = "higher", type = "c")
t2_c # %column
##          higher     no    yes
## high_use                     
## FALSE            43.75  71.19
## TRUE             56.25  28.81
## Sum             100.00 100.00

Based in the summary(0) you can see that there is not a significant difference p<.02, x^2(1)=5.487 between alcohol consumption and wanthing to go higher education. There might be a small trend though. Also, the variable is very skewed, since only few students don’t want to continue their studies (n=16).

Creating a dataframe

# data frame based on frequencies
df_3 = data.frame(group=c("No False", "No True", "Yes False", "Yes True"),
                value=c(7, 9, 252, 111))
df_3
##       group value
## 1  No False     7
## 2   No True     9
## 3 Yes False   252
## 4  Yes True   111
# data frame based on col%
df_p3 = data.frame(group=c("No False", "No True", "Yes False", "Yes True"),
                percentage=c(44, 56, 71, 29))
df_p3
##       group percentage
## 1  No False         44
## 2   No True         56
## 3 Yes False         71
## 4  Yes True         29

NOTE.

  • No False = Not wanting to continue higher education, low alcohol consumption
  • No True = Not wanting to continue higher education, high alcohol consumption
  • Yes False = Wanting to continue higher education, low alcohol consumption
  • Yes True = Wanting to continue higher education, high alcohol consumption

Creating a plot - wanthing to continue higher education (higher) and alcohol consumption (high_use)

#based on column%
ggplot(df_p3, aes(x=group, y=percentage, fill=group)) + #basic plot
  geom_bar(stat="identity") + #define a plot and put groups are side-by-side
  geom_text(aes(label=percentage), vjust=-0.3, size=3.5) + #add frequencies
  scale_fill_manual(values=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2")) #add colours

Based on the plot you can see that the alcohol consumption is quite even (F = 44%, TRUE = 56%) between students who don’t want to continue higher education, yet a bit over half have high alcohol consumption. Majority (71%) of students who want to continue their studies reported low alcohol consumption, and 29% high alcohol use. However, since the data is skewed we might also want to print the frequency plot.

#based on frequencies
ggplot(df_3, aes(x=group, y=value, fill=group)) + #basic plot
  geom_bar(stat="identity") + #define a plot and put groups are side-by-side
  geom_text(aes(label=value), vjust=-0.3, size=3.5) + #add frequencies
  scale_fill_manual(values=c("#E69F00", "#D55E00", "#56B4E9", "#0072B2")) #add colours

In this example it might be smarter to present the frequence plot, since otherwise people might miss-interpret the plot.

These result support sort of our previous hypothesis: wanting to take higher education is associated with lower alcohol consumption. However, this interpretation is not very sientific. Better interpretation is to establish that there is not enough data indicating this would be the case, since only 16 participants did not want to continue their studies.


5. Logistic regression (0-5 points)

  • Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.
  • Present and interpret a summary of the fitted model.
  • Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them.
  • Interpret the results and compare them to your previously stated hypothesis.

Hint: If your model includes factor variables see for example the RHDS book or the first answer of this stackexchange thread on how R treats and how you should interpret these variables in the model output (or use some other resource to study this).

This model predicts high alcohol consumption (more than 2). Because, the variable is not continuous, but binary (FALSE, TRUE) we need to use general linear model or mixed-model, where we specify the model as “binomial”.

The models variables are:

  • sexM indicates that male is the reference group
  • absences is continuous model
  • goout is actually ordinal variable (1=very low, 5=very high). The model treats its sort of continous variable and therefore it does not have reference group
  • higheryes indicates that yes is reference group
m_0 <- glm(high_use ~ sex + absences + goout + higher, data = alc_a3, family = "binomial") # glm()
summary(m_0) # you can also get the whole summary of the model using summary()-command
## 
## Call:
## glm(formula = high_use ~ sex + absences + goout + higher, family = "binomial", 
##     data = alc_a3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7787  -0.8120  -0.5286   0.7990   2.4772  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.67317    0.77959  -4.712 2.46e-06 ***
## sexM         0.99134    0.26216   3.781 0.000156 ***
## absences     0.08279    0.02289   3.617 0.000298 ***
## goout        0.72110    0.12093   5.963 2.48e-09 ***
## higheryes   -0.48263    0.59294  -0.814 0.415674    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 373.43  on 365  degrees of freedom
## AIC: 383.43
## 
## Number of Fisher Scoring iterations: 4
OR_0 <- coef(m_0) %>% exp # compute odds ratios (OR)
CI_0 <- confint(m_0) # compute confidence intervals (CI)
## Waiting for profiling to be done...
cbind(OR_0, CI_0) # print out the odds ratios with their confidence intervals
##                   OR_0       2.5 %     97.5 %
## (Intercept) 0.02539593 -5.24084803 -2.1674521
## sexM        2.69484233  0.48359815  1.5136772
## absences    1.08631589  0.03959962  0.1303793
## goout       2.05669840  0.48977814  0.9649756
## higheryes   0.61716027 -1.68255125  0.6719255
  • 1.0 = exposure between groups are the same - no difference
  • greter than 1.0 = reference group > other group(s) - risk factor
  • less than 1.0 = reference group < other group(s) - protective factor

Every other pedictor, except higher, is statistically significant p<.001.

sex:

  • Since male is the reference group and the estimate is positive it means that being a male male increases the log odds of having higher alcohol consumption by 0.99.
  • The odds ratio, OR is 2.70 (greter than 1) meaning that male have higher risk for higher alcohol consumption than females.
  • Because the confidence intervals, CIs are not between 0 and 1, we can conclude that this is statistically significant. Similar to summary(m_0) already pointed out p<.001
  • In other words, being a male is associated having higher risk of higher alcohol consumption than females.
  • This results supports our hypothesis that male has higher alcohol consumption than females. Male gender can be considered as a risk factor when predicting high alcohol use among students.

Absences:

  • Since this variable is continuous, each one-unit change in absences (more absences) will increase the log odds for having higher alcohol consumption by 0.08.
  • The OR is 1.09 (greater than 1), meaning that higher amount of absences from school is increasinf the risk of higher alcohol use.
  • Since C.Is are not between 0 and 1, this is statistically significant, like model already stated p<.001
  • In other words, higher amount of school absences is associated with having higher alcohol consumption. School absences can be considered as a risk factor when predicting high alcohol use among students.

Going out with friends (goout):

  • Since this variable is treated as “continous”, each one-unit change in going out with friends (going out often) will increase the log odds of having higher alcohol consumption by 0.72.
  • The OR is 2.06 (greater than 1), indicating that the more often student spent time with their friends it will increase the risk of higher alcohol use.
  • The CIs are not between 0 and 1, so this variable is significant predicting higher alcohol use. The model states the same thing p<.001
  • In other words, if student is going out with friends more often they have higher risk to use more alcohol than others. Going out with friends can be considered as a risk factor when predicting high alcohol use among students.

Wanting to go higher education (higher):

  • Since, yes is reference group and the estimate is negative it indicates that wanting to continue to higher education will decrease the log odds of having higher alcohol consumption by -0.48.
  • The OR is 0.61, so below 1, indicating that the risk for higher alcohol use is lower among students who want to continue to higher education after graduation than for those who don’t.
  • However, the CIs are between 0 and 1 (2.5% = -1.7 and 97.5 = 0.67), indicating that future plans of continue their studies is not associated with higher alcohol use.
  • In other words, wanting to continue studies to higher education may protect student to use less alcohol. Wanting to continue to higher education could be considered as a proective factor when predicting high alcohol use among students. But, this variable was non-significant.

NOTE. The difference between Null and Residual deviances tells us the model fit. Greater difference, better fit. However, this is arbitrary. Odds ratios in general:

The results, except higher-variable, are in line with our hypothesis.


6. Predictive power of the m_0 (0-3 points)

  • Compare the performance of the model with performance achieved by some simple guessing strategy.

2x2 cross tabulation of predictions versus the actual values

m1 <- glm(high_use ~ sex + absences + goout, data = alc_a3, family = "binomial") #drop the higher variable from the model
alc_a3 <- mutate(alc_a3, probability = predict(m1, type = "response"))
alc_a3 <- mutate(alc_a3, prediction = probability > 0.5)

select(alc_a3, high_use, sex, absences, goout, probability, prediction) %>% tail(10) #sanity check.
##     high_use sex absences goout probability prediction
## 361    FALSE   M        3     3  0.32716276      FALSE
## 362    FALSE   M        0     2  0.15403200      FALSE
## 363     TRUE   M        7     3  0.40566343      FALSE
## 364    FALSE   F        1     3  0.12866204      FALSE
## 365    FALSE   F        6     3  0.18408147      FALSE
## 366    FALSE   F        2     2  0.07202493      FALSE
## 367    FALSE   F        2     4  0.24971608      FALSE
## 368    FALSE   F        3     1  0.03919794      FALSE
## 369     TRUE   M        4     5  0.69415157       TRUE
## 370     TRUE   M        2     1  0.09434569      FALSE

The probability indicates how well our model fits in single datapoints. If the model predicts the datapoint well (over 0.5 probability) it gets value TRUE (prediction).

Graphic visualizing of actual values and the predictions.

ggplot(alc_a3, aes(x = high_use, y = probability)) +
  geom_point(size=2, aes(colour=prediction))

The plot illustrates the probability that single data points are succecfully (in blue) predicted in our model (probability is over 0.50). Based on the plot, you can see that we are actually missing fair bit of data points (in red) which our model fails to explain.

table(high_use = alc_a3$high_use, prediction = alc_a3$prediction) %>%  # tabulate the target variable versus the predictions
 prop.table() %>%  # explore probabilities.
 addmargins() #add margins.
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65405405 0.04594595 0.70000000
##    TRUE  0.16486486 0.13513514 0.30000000
##    Sum   0.81891892 0.18108108 1.00000000

Based on the first cross-tabultaion table, we have 67 datapoints that the model was able to successfully predict (prediction = TRUE; 17+50=67) and 303 that it was not able to predict (prediction = FALSE; 242+61=303) with our model.

We can calculate the success and error rates as following:

  • success rate: 67/370 ~ 0.18 ~ 18%
  • error rate: 303/370 ~ 0.82 ~ 82%

Specifically, it seems that our model had difficulties predict the “low use alcohol”-category, only 7%.

  • high_use = FALSE: 242 + 17 = 259; 242/259 = 0.93 = 93%; 100% - 93% = 7%
  • high_use = TRUE: 61 + 50 = 111; 61/111 = 0.55 = 55%; 100% - 55% = 45%

Based on the second cross-tabluation table prop.table() %>%, we get the same table, but portions. Similarly ~ 19% was successfully predicted based on our model (5% + 14%) and ~ 81% unsuccessful (65% + 16%).

Based on the third cross-tabulation table addmargins() calculates the row, column and over all %, supporting the previous explenations: 81% error rate and 18% success rate.

Total proportion of inaccurately classified individuals training error First, run this code.

loss_func <- function(class, prob) { # define a loss function (mean prediction error)
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# calculates the average wrong predictions of high_use
loss_func(class = alc_a3$high_use, prob = 0)   #0.3 
## [1] 0.3
loss_func(class = alc_a3$high_use, prob = 0.2) #0.3 
## [1] 0.3
loss_func(class = alc_a3$high_use, prob = 0.5) #0 
## [1] 0
loss_func(class = alc_a3$high_use, prob = 0.7) #0.7 
## [1] 0.7
loss_func(class = alc_a3$high_use, prob = 1)   #0.7
## [1] 0.7
loss_func(class = alc_a3$high_use, prob = alc_a3$probability) #0.2108108
## [1] 0.2108108

Unfortunately, I don’t know what those differences comes from. I assume that the total proportion of inaccurate classifies individuals (=training error) is the code, where the prob=0. If this is the case I would assume that our model would miss-classify 30% of the data points. But yet again, this is just a hunch not actually knowledge. Or maybe its the opposite, where prob=1, and our model would miss-classifdy 70%, which is more closer to the cross-tabulation and plot (~82% vs ~18%)


7. Bonus: 10-fold cross-validation (0-2 points to compensate loss points from above)

  • Perform 10-fold cross-validation on your model. ** add (K=10)**
  • Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model?
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5 # loss function (average prediction error) more than 50%
  mean(n_wrong)
}
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
# This is the Exercise3 model.
library(readr)
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
library(dplyr)
m <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
cv$delta[1] # 0.2405405
## [1] 0.2405405
# My model
loss_func(class = alc_a3$high_use, prob = 0) #0.3; average of wrong predictions
## [1] 0.3
#My model with K-folding
cv_a3 <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m1, K = 10) # K-fold cross-validation, with 10-fold cross-validation

# average number of wrong predictions in the cross validation
cv_a3$delta[1] #0.2108108
## [1] 0.2216216
  • The first model (Exercise3) prediction error is 0.24
  • The second model (my model without folding) prediction error is 0.3
  • The last model (my model with folding) prediction error is 0.21

My model has a bit better (0.21<0.24) test set than the Exercise3 example


8. Super-Bonus: cross-validation to different model (0-4 points)

  • Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors).
  • Start with a very high number of predictors and explore the changes in the training and testing errors as you move to models with less predictors.
  • Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.

Variables:

  • Dependent = high_use (alcohol consumption incl. both weekend/weekdays is higher than 2)

Individual predictors (model_a):

  • sex (binary: (F) female or (M) male)
  • age student’s age (numeric: from 15 to 22)

Family predictors (model_b):

  • Pstatus parent’s cohabitation status (T) living together or (A) apart
  • Medu mother’s education (numeric: 0 none, 1 primary education, 2, 3 or 4 higher education)
  • Fedu father’s education (numeric: 0 none, 1 primary education, 2, 3 or 4 higher education)
  • famsup family educational support (binary: yes or no)

Relationship predictors (model_c):

  • romatic with a romantic relationship (binary: yes or no)
  • famrel quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • goout going out with friends; numeric: from 1 - very low to 5 - very high
m_e1 <- glm(high_use ~ sex + age + Pstatus + Medu + Fedu + famsup + romantic + famrel + goout, data = alc_a3, family = "binomial") # glm()
summary(m_e1) # you can also get the whole summary of the model using summary()-command
## 
## Call:
## glm(formula = high_use ~ sex + age + Pstatus + Medu + Fedu + 
##     famsup + romantic + famrel + goout, family = "binomial", 
##     data = alc_a3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5034  -0.7839  -0.5248   0.8328   2.7038  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.08144    1.99332  -2.048  0.04060 *  
## sexM         1.00902    0.26233   3.846  0.00012 ***
## age          0.13919    0.11148   1.249  0.21181    
## PstatusT    -0.16469    0.41922  -0.393  0.69444    
## Medu        -0.13851    0.15388  -0.900  0.36806    
## Fedu         0.06658    0.15400   0.432  0.66551    
## famsupyes    0.02742    0.27038   0.101  0.91923    
## romanticyes -0.26041    0.27990  -0.930  0.35218    
## famrel      -0.45258    0.14096  -3.211  0.00132 ** 
## goout        0.78842    0.12423   6.347  2.2e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 375.84  on 360  degrees of freedom
## AIC: 395.84
## 
## Number of Fisher Scoring iterations: 4
m_e2 <- glm(high_use ~ Pstatus + Medu + Fedu + famsup + romantic + famrel + goout, data = alc_a3, family = "binomial") # glm()
summary(m_e2) # you can also get the whole summary of the model using summary()-command
## 
## Call:
## glm(formula = high_use ~ Pstatus + Medu + Fedu + famsup + romantic + 
##     famrel + goout, family = "binomial", data = alc_a3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7118  -0.7869  -0.5605   0.9787   2.4585  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.50647    0.80400  -1.874  0.06097 .  
## PstatusT    -0.11758    0.40944  -0.287  0.77398    
## Medu        -0.10248    0.14972  -0.684  0.49368    
## Fedu         0.05997    0.14909   0.402  0.68750    
## famsupyes   -0.17237    0.25814  -0.668  0.50430    
## romanticyes -0.25006    0.26880  -0.930  0.35222    
## famrel      -0.39998    0.13538  -2.954  0.00313 ** 
## goout        0.80180    0.12025   6.668 2.59e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 392.67  on 362  degrees of freedom
## AIC: 408.67
## 
## Number of Fisher Scoring iterations: 4
m_e3 <- glm(high_use ~ romantic + famrel + goout, data = alc_a3, family = "binomial") # glm()
summary(m_e3) # you can also get the whole summary of the model using summary()-command
## 
## Call:
## glm(formula = high_use ~ romantic + famrel + goout, family = "binomial", 
##     data = alc_a3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5967  -0.7778  -0.5440   0.9544   2.4264  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.8394     0.6253  -2.941  0.00327 ** 
## romanticyes  -0.2582     0.2684  -0.962  0.33607    
## famrel       -0.3969     0.1350  -2.939  0.00329 ** 
## goout         0.7954     0.1195   6.655 2.83e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 393.71  on 366  degrees of freedom
## AIC: 401.71
## 
## Number of Fisher Scoring iterations: 4
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5 # loss function (average prediction error) more than 50%
  mean(n_wrong)
}
cv_e1_test <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e1, K = nrow(alc_a3))
cv_e1_test$delta[1]
## [1] 0.2324324
cv_e1 <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e1, K = 10)
cv_e1$delta[1]
## [1] 0.2243243
cv_e2_test <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e2, K = nrow(alc_a3))
cv_e2_test$delta[1]
## [1] 0.2594595
cv_e2 <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e2, K = 10)
cv_e2$delta[1]
## [1] 0.2621622
cv_e3_test <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e3, nrow(alc_a3))
cv_e3_test$delta[1]
## [1] 0.2513514
cv_e3 <- cv.glm(data = alc_a3, cost = loss_func, glmfit = m_e3, K = 10)
cv_e3$delta[1]
## [1] 0.2513514
e1d_test=cv_e1_test$delta[1]
e1d=cv_e1$delta[1]
e2d_test=cv_e2_test$delta[1]
e2d=cv_e2$delta[1]
e3d_test=cv_e3_test$delta[1]
e3d=cv_e3$delta[1]

deltas_test=c(e1d_test, e2d_test, e3d_test)
deltas_test
## [1] 0.2324324 0.2594595 0.2513514
deltas_train=c(e1d, e2d, e3d)
deltas_train
## [1] 0.2243243 0.2621622 0.2513514
deltas=c(e1d_test, e2d_test, e3d_test,e1d, e2d, e3d)
deltas
## [1] 0.2324324 0.2594595 0.2513514 0.2243243 0.2621622 0.2513514
group_e1=c("testing", "testing","testing", "training", "training", "training")

label_names=c("all test", "family test", "relationship test", 
              "all train", "family train", "relationship train")

superbonus1 = data.frame(x=c(1:3),
                         deltas,
                         group_e1,
                         label_names)

library(ggrepel)
ggplot(superbonus1, aes(x = x, y = deltas, color = group_e1, group = group_e1)) + 
  geom_point() + geom_line() +
   geom_label_repel(aes(label = label_names),
                  box.padding   = 0.35, 
                  point.padding = 0.5,
                  segment.color = 'grey50') 

Unsure if this is the correct one. I found this link that I found useful.


End of assignment 3.